Skip to content

Conversation

admin-coderabbit
Copy link
Owner

@admin-coderabbit admin-coderabbit commented Feb 4, 2026

This pull request was automatically created by @coderabbitai/e2e-reviewer.

Batch created pull request.

Summary by CodeRabbit

  • New Features

    • Improved incident occurrence tracking for monitor failures with detailed failure metadata, environment context, and human-readable failure summaries.
    • Enhanced failure reason generation with pluralization support and contextual information for better incident visibility.
  • Refactor

    • Streamlined monitor failure handling by consolidating internal orchestration logic into a unified incident threshold evaluation process, improving code maintainability.

…0528)

Since we'll be doing more with issue occurrences split out the concept
of incidents into it's own logic module, as well as incident_occurrence
into it's own module

Part of GH-80527
@coderabbit-eval
Copy link

coderabbit-eval bot commented Feb 4, 2026

📝 Walkthrough

Walkthrough

The changes introduce incident occurrence publishing for monitor failures. A new module creates and emits incident occurrences with failure metadata and evidence, while a new function evaluates monitor environments for incident state transitions based on failure thresholds. The existing mark_failed function is refactored to delegate to this new incident threshold logic.

Changes

Cohort / File(s) Summary
Incident Occurrence Publishing
src/sentry/monitors/logic/incident_occurrence.py
New module introducing create_incident_occurrence() to construct and emit IssueOccurrence objects to Kafka with failure metadata, evidence_display (reason, environment, last check-in), and contexts. Includes helper functions get_failure_reason() for human-readable failure summaries and get_monitor_environment_context() for context dictionaries, plus mappings for localized failure status strings.
Incident Threshold Logic
src/sentry/monitors/logic/incidents.py
New function try_incident_threshold() to evaluate monitor environments for incident state transitions based on failure thresholds. Handles environment status evaluation, check-in sequence retrieval, incident creation/retrieval, and delegates to create_incident_occurrence() for each failure when appropriate. Emits monitor_environment_failed signal.
Mark Failed Refactoring
src/sentry/monitors/logic/mark_failed.py
Simplified to delegate incident orchestration to try_incident_threshold(), removing ~260 lines of internal threshold computation, incident management, and occurrence publishing logic. Function now returns boolean result from threshold evaluation.
Type Definitions
src/sentry/monitors/types.py
Added new TypedDict SimpleCheckIn with fields for id, date_added, and status to represent minimal check-in objects.

Sequence Diagram

sequenceDiagram
    participant Check as Failed Check-in
    participant MarkFailed as mark_failed()
    participant Threshold as try_incident_threshold()
    participant Store as Monitor Environment<br/>(Database)
    participant Incident as Incident<br/>(Database)
    participant Occurrence as create_incident_occurrence()
    participant Kafka as Kafka

    Check->>MarkFailed: failed_checkin, ts, received
    MarkFailed->>Threshold: failed_checkin, failure_issue_threshold, received
    
    Threshold->>Store: Get monitor_environment
    alt monitor_env exists and status is OK/ACTIVE
        Threshold->>Store: Fetch previous check-ins up to threshold
        Threshold->>Store: Set status to ERROR, save
        Threshold->>Incident: get_or_create incident
        Incident-->>Threshold: incident instance
    else status is ERROR
        Threshold->>Incident: Retrieve active incident
    end
    
    alt Incident exists and not muted
        Threshold->>Occurrence: create_incident_occurrence(failed_checkins, checkin, incident, received)
        Occurrence->>Kafka: produce_occurrence_to_kafka(IssueOccurrence)
        Kafka-->>Occurrence: ✓ Published
    end
    
    Threshold-->>MarkFailed: boolean result
    MarkFailed-->>Check: True/False
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 The monitors now speak with care,
Each failure captured in the air,
Through Kafka's stream, occurrences flow,
Incidents tracked from high to low,
A refactored dance, both lean and bright! ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 1
❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title check ❓ Inconclusive The title refers to a test automation process but lacks specificity about actual code changes—does not clearly describe the refactoring of monitor incident logic. Revise title to describe the primary change, such as 'Refactor monitor incident handling into dedicated modules' or 'Extract incident occurrence logic into separate module'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch monitor-incident-refactor-after

Comment @coderabbit eval help to get the list of available commands and usage tips.

@coderabbit-eval coderabbit-eval bot closed this Feb 4, 2026
Copy link

@coderabbit-eval coderabbit-eval bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@src/sentry/monitors/logic/incident_occurrence.py`:
- Around line 159-171: In get_monitor_environment_context(MonitorEnvironment),
the function copies monitor.config to local variable config and mutates
schedule_type but then returns monitor_environment.monitor.config (the
original); fix by returning the modified local config (or assign the modified
copy back) in the returned dict under the "config" key so the schedule_type
display change is preserved.
- Around line 130-156: Fix three issues in get_failure_reason: correct the
docstring typo to "human readable"; change the membership check when building
status_counts to "if checkin['status'] in HUMAN_FAILURE_STATUS_MAP" (remove
.keys()); and replace list(status_counts.keys())[0] with
next(iter(status_counts.keys())) (or next(iter(status_counts))) when selecting
the single status for SINGULAR_HUMAN_FAILURE_MAP to avoid creating a full list.
🧹 Nitpick comments (1)
src/sentry/monitors/logic/incidents.py (1)

36-53: Unnecessary list materialization inside any().

Line 52 uses any([...]) with a list comprehension. A generator expression (any(... for ...)) short-circuits without allocating the full list.

Proposed fix
-            if any([checkin["status"] == CheckInStatus.OK for checkin in previous_checkins]):
+            if any(checkin["status"] == CheckInStatus.OK for checkin in previous_checkins):

Comment on lines +130 to +156
def get_failure_reason(failed_checkins: Sequence[SimpleCheckIn]):
"""
Builds a humam readible string from a list of failed check-ins.

"3 missed check-ins detected"
"2 missed check-ins, 1 timeout check-in and 1 error check-in were detected"
"A failed check-in was detected"
"""

status_counts = Counter(
checkin["status"]
for checkin in failed_checkins
if checkin["status"] in HUMAN_FAILURE_STATUS_MAP.keys()
)

if sum(status_counts.values()) == 1:
return SINGULAR_HUMAN_FAILURE_MAP[list(status_counts.keys())[0]]

human_status = get_text_list(
[
"%(count)d %(status)s" % {"count": count, "status": HUMAN_FAILURE_STATUS_MAP[status]}
for status, count in status_counts.items()
],
last_word=_("and"),
)

return _("%(problem_checkins)s check-ins detected") % {"problem_checkins": human_status}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Minor issues: typo in docstring and static analysis findings.

  1. Line 132: Typo — "humam readible" → "human readable".
  2. Line 142: Idiomatic Python — in HUMAN_FAILURE_STATUS_MAP suffices (.keys() is implicit).
  3. Line 146: Per Ruff RUF015, prefer next(iter(status_counts.keys())) over list(status_counts.keys())[0] to avoid materializing the full list for a single element.
Proposed fixes
-    """
-    Builds a humam readible string from a list of failed check-ins.
+    """
+    Builds a human readable string from a list of failed check-ins.
-        if checkin["status"] in HUMAN_FAILURE_STATUS_MAP.keys()
+        if checkin["status"] in HUMAN_FAILURE_STATUS_MAP
-        return SINGULAR_HUMAN_FAILURE_MAP[list(status_counts.keys())[0]]
+        return SINGULAR_HUMAN_FAILURE_MAP[next(iter(status_counts))]
🧰 Tools
🪛 Ruff (0.14.14)

[warning] 146-146: Prefer next(iter(status_counts.keys())) over single element slice

Replace with next(iter(status_counts.keys()))

(RUF015)

🤖 Prompt for AI Agents
In `@src/sentry/monitors/logic/incident_occurrence.py` around lines 130 - 156, Fix
three issues in get_failure_reason: correct the docstring typo to "human
readable"; change the membership check when building status_counts to "if
checkin['status'] in HUMAN_FAILURE_STATUS_MAP" (remove .keys()); and replace
list(status_counts.keys())[0] with next(iter(status_counts.keys())) (or
next(iter(status_counts))) when selecting the single status for
SINGULAR_HUMAN_FAILURE_MAP to avoid creating a full list.

Comment on lines +159 to +171
def get_monitor_environment_context(monitor_environment: MonitorEnvironment):
config = monitor_environment.monitor.config.copy()
if "schedule_type" in config:
config["schedule_type"] = monitor_environment.monitor.get_schedule_type_display()

return {
"id": str(monitor_environment.monitor.guid),
"slug": str(monitor_environment.monitor.slug),
"name": monitor_environment.monitor.name,
"config": monitor_environment.monitor.config,
"status": monitor_environment.get_status_display(),
"type": monitor_environment.monitor.get_type_display(),
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Bug: modified config copy is never used — original config is returned instead.

Line 160 creates a local copy of config and line 162 mutates schedule_type on that copy, but line 168 returns monitor_environment.monitor.config (the original, unmodified dict). The schedule_type display transformation is silently discarded.

🐛 Proposed fix
     return {
         "id": str(monitor_environment.monitor.guid),
         "slug": str(monitor_environment.monitor.slug),
         "name": monitor_environment.monitor.name,
-        "config": monitor_environment.monitor.config,
+        "config": config,
         "status": monitor_environment.get_status_display(),
         "type": monitor_environment.monitor.get_type_display(),
     }
🤖 Prompt for AI Agents
In `@src/sentry/monitors/logic/incident_occurrence.py` around lines 159 - 171, In
get_monitor_environment_context(MonitorEnvironment), the function copies
monitor.config to local variable config and mutates schedule_type but then
returns monitor_environment.monitor.config (the original); fix by returning the
modified local config (or assign the modified copy back) in the returned dict
under the "config" key so the schedule_type display change is preserved.

Sign in to join this conversation on GitHub.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant